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I.  INTRODUCTION 


Greater  understanding  of  the  theory  of  digital  signal  proces- 
sing coupled  with  dramatic  advances  in  hardware  technology  and  software 
engineering  has  led  to  improved  remote  sensing  capabilities  for  military 
and  civilian  applications.  Nowhere  has  the  Impact  of  theoretical  con- 
cepts and  hardware/ software  technological  advances  been  felt  more  than  in 
the  field  of  radar  remote  sensing.  This  is  due  to  a large  extent  to  the 
inherent  difficulty  of  the  basic  radar  signal  processing  problem. 

Radar  signal  processing  differs  from  other  signal  processing  pro- 
blems in  that  very  high  data  throughput  rates  as  well  as  wide  dynamic 
ranges  are  often  simultaneously  required.  This  unfortunate  coincidence 
results  directly  from  the  physics  of  the  remote  sensing  problem  where 
noncooperative  (evasive)  target  detection  using  primary  radar  is  hampered 
by  partically  correlated  noise  in  the  form  of  ground,  weather,  and  chaff 
clutter.  As  a consequence,  radar  signal  processors  must  be  fast  (capable 
of  high  data  throughput  rates)  and  intelligent  (capable  of  executing 
algorithms  which  can  distinguish  between  correlated  noise  and  targets 
of  interest)  . The  challenge,  then,  is  to  define  a radar  signal  processor 
which  can  execute  the  resulting  complex  algorithms  in  real-time  with 
arithmetic  precision  adequate  to  allow  differentiation  between  small 
cross-section  targets  and  large  cross-section  clutter. 

A.  Study  Objectives 

The  objective  of  this  study  is  to  investigate  the  use  of  micro 
processors  and  other  currently  available  large  scale  integrated  (LSI) 
circuitry  for  radar  signal  processing  and  to  define  a structure  which  is 
capable  of  executing  algorithms  in  real-time.  A processing  throughput 
objective  is  provided  by  the  Advanced  Sensors  Directorate's  Quiet  Radar 
parameters  [1].  The  general  processing  requirements  of  this  radar  serve 
as  a baseline  for  the  present  study.  While  the  study  is  theoretical  in 
nature,  the  use  of  these  real-world  radar  parameters  tends  to  anchor  the 
results  in  a context  which  can  be  meaningful  in  the  near-term.  In  parti- 
cular, it  is  envisioned  that  this  study  will  serve  as  the  basis  for  the 
later  design  and  fabrication  of  a high  speed,  flexible  radar  signal  pro- 
cessor with  a broad  range  of  applications  in  remote  sensing. 

A great  deal  of  attention  has  been  focused  in  recent  years  on  the 
application  of  microprocessors  to  various  data  processing  and  control 
applications.  More  recently,  advances  in  microprocessor  technologies 
have  presented  apparent  opportunities  for  increased  data  throughput  and 
processing  flexibility.  As  a consequence,  various  processors  designed 
especially  for  high  throughput  have  been  proposed  [2-6],  The  present 
task  has  considered  the  application  of  microprocessors  as  well  as  other 
state-of-the-art  LSI  potentially  suitable  for  use  in  high-speed  signal 
processing  applications. 
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B.  Study  Scope 

The  scope  of  this  work  encompasses  the  basic  elements  of  I 

a radar  signal  processor  architecture  analysis  at  the  block  diagram  level. 

Tradeoffs  are  made  between  candidate  approaches  as  they  apply  to  remote 
sensing  in  general  and  to  the  Advanced  Sensors  Directorate's  Quiet  Radar 
In  particular.  It  is  envisioned  that  the  futher  reduction  of  these  system 
level  descriptions  to  fundamental  logic  circuit  diagrams  will  be  accom- 
plished in  a related  follow-on  effort. 

C.  Study  Approach 

The  approach  taken  in  this  study  was  to  investigate  signal 
processor  hardware  architectures  to  determine  their  applicability  to  the 
radar  processing  problem.  A basic  assumption  throughout  this  study  was 
that  existing  state-of-the-art  LSI  including  microprocessors  and  special 
purpose  LSI  hardware  would  be  used  as  the  processor  building  blocks. 

Analysis  of  candidate  architectures  was  carried  on  in  light  of  the 
baseline  signal  processing  throughput  requirements  of  the  Advanced  Sensors 
Directorate's  Quiet  Radar  program  but  was  not  restricted  to  consideration 
of  these  parameters  only.  Consideration  was  also  given  to  modular  archi- 
tectures which  offer  flexibility  in  terms  of  expansion  through  replication 
of  constituant  components.  Such  architectures  have  certain  cost  advantages 
as  well  as  robustness  in  terms  of  hardware  and  software  reliability  and 
maintainability. 

The  following  discussion  presents  the  study  results  in  a top-down 
fashion.  Candidate  classes  of  signal  processor  architectures  are  first 
discussed.  Desirable  attributes  as  well  as  shortcomings  of  microprocessor- 
based  signal  processors  are  then  considered  in  relation  to  the  high-speed 
radar  signal  processing  problem.  Succeeding  sections  relate  candidate 
processor  architectures  to  the  baseline  radar  parameters  of  interest. 

Finally,  a specific  radar  signal  processor  architecture  is  proposed  as 
a candidate  for  later  detailed  design  and  fabrication. 

II.  CANDIDATE  CLASSES  OF  SIGNAL  PROCESSOR  ARCHITECTURES 

It  is  desirable  to  define  a radar  signal  processor  architecture 
which  achieves  maximum  data  throughput  and  flexibility  with  a minimum 
investment  in  hardware  and  software.  Candidate  processing  elements  to  be 
used  in  this  study  are  provided  by  the  families  of  LSI  circuits  presently 
available.  Of  these  classes,  the  following  have  been  considered: 
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a)  Eight-bit  metal  oxide  semiconductor  (MOS)  microprocessors. 

b)  Four-bit  slice  microprocessors. 

c)  Eight-bit  microprocessors  + special  purpose  arithmetic 
unit . 

d)  Special  purpose  arithmetic  hardware. 


The  performance-related  characteristics  of  devices  of 
as  follows: 

this  nature 

a) 

Instruction  cycle  time. 

b) 

Data  word  width. 

c) 

Instruction  set. 

d) 

Small  scale  integration/medium  scale  integration 
overhead  required. 

(SSI/MSI) 

e) 

Bus  structure. 

f) 

Input/Output  (I/O)  capabilities. 

I The  performance  characteristics  were  considered  for  each  of  the  previously 

listed  classes  of  microprocessor  architectures.  The  following  sections 
consider  three  specific  classes  of  processor  architectures. 

A.  Single  Mircroprocessor  Architectures 

jl 

i Previous  works  on  multiprocessing  systems  have  defined 

j single  central  processing  unit  (CPU)  computers  as  "Single-Instruction 

: Single  Data  (SISD)"  machines  [7].  The  majority  of  the  general  purpose 

i computers  presently  in  use  are  SISD  architectures.  SISD  architectures  use 

i a single-control  unit  to  route  data  into  and  out  of  the  CPU.  As  a result 

I only  one  arithmetic  process  such  as  addition,  subtraction,  multiplication, 

' or  division  can  occur  at  one  time.  Furthermore,  the  movement  of  data  is 

usually  accomplished  by  means  of  a single  data  bus  in  such  designs. 

The  primary  advantage  of  single-instruction  single  data  architectures 
is  the  simplicity  of  the  hardware  and  software  structures.  These  machines 
require  little  in  the  way  of  hardware  and  are  straightforward  to  program. 
Unfortunately,  these  attributes  are  achieved  at  the  expense  of  data 
throughput  and  flexibility  as  demonstrated  in  later  sections  of  this 
j report.  However,  the  SISD  class  processor  remains  Important  because  it 

! is  the  primary  building  block  of  more  complex  processors. 

, 1.  Non-Bit  Slice  Processors 

A mlcorporcessor  which  Illustrates  the  SISD  architec- 
ture is  the  Intel  8080A,  8-bit  machine.  For  purposes  of  this  study,  the 
8080A  has  been  chosen  as  the  baseline  architecture  to  which  other  designs 
may  be  compared.  Specifically,  the  8080A  was  chosen  for  the  following 
reasons: 
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a) 

It 

is 

the  most 

widely  used 

microprocessor . 

b) 

It 

is 

low-cost 

and  readily 

available. 

c) 

It 

is 

reasonable  to  consider  an  array  of  such 

machines  in  more  complex  architectures. 

Unfortunately,  the  single  data  bus  structure  permits  only  limited  data 
flow.  Obvious  variations  of  this  bcisic  structure  therefore  include 
multiple  operational  and  resultant  buses  to  permit  increased  flexibility 
in  terms  of  data  management  alternatives. 

2.  Bit-Slice  Processors 

Attempts  to  achieve  higher  data  throughput  rates  with 
programmable  LSI  have  resulted  in  the  creation  of  bit-slice  microprocessor 
devices.  The  primary  advantage  of  these  devices  is  their  faster  instruc- 
tion cycle  time.  Their  major  disadvantages  include  increased  part  counts 
due  to  support  circuit  requirements  and  the  fact  that  they  are  generally 
harder  to  program.  Bit-slice  microprocessors  have  been  used  in  two  pri- 
mary application  areas  as  follows: 

a)  As  instruction  set  emulators  where  microprogram- 
med bit-slice  machines  are  made  to  look  like  other 
processors . 

b)  In  moderate  speed  signal  processing  applications 
where  advantage  can  be  taken  of  their  micro-instruc- 
tion power  and  faster  instruction  execution  time. 

Bit-slice  processors  using  the  Motorola  M10800  have  been  designed 
by  Motorola  [8],  Raytheon  [91,  and  others.  These  machines  have  instruc- 
tion cycle  times  on  the  order  of  100  nsec. 

Another  bit-slice  microprocessor  is  the  Advanced  Micro  Devices  (AMD) 
AM2900,  4-bit-slice  device.  This  processor  uses  Schottky  bipolar  LSI 
technology  and  executes  instructions  at  a rate  of  approximately  250  nsec. 
Although  the  AM2900  cycle  time  is  slower  than  the  M10800,  it  is  generally 
easier  to  program  because  the  AM2900  is  a 2-bu8  structure  while  the  M10800 
is  a 3-bu8  design.  Thus,  more  options  and  potentially  more  powerful 
Instructions  are  available  with  the  M10800, 

At  the  present  time,  greater  software  support  is  available  with  the 
AM2900  including  a cross-assembler.  However,  Motorola  is  in  the  process 
of  developing  a software  support  package  for  the  M10800  which  should  be 
available  by  the  end  of  the  calendar  year,* 


*Balph,  Tom,  Motorola,  Inc.,  Phoenix,  Arizona,  September  1977  (Private 
Communication) , 
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Finally,  a third  bit-slice  microprocessor  has  recently  become 
available.  This  machine  differs  from  the  others  in  that  it  is  an  8-bit- 
slice  device.  The  part  has  been  introduced  by  RCA  and  is  known  as  the 
ATMAC  microprocessor.  Unfortunately,  this  device  is  not  available  as  a 
commercial  component  but  may  be  purchased  only  from  RCA  as  a part  of  a 
system.  On  the  positive  side,  RCA  does  have  extensive  documentation  and 
software  support  available  [ 10-12] . The  ATMAC  is  attractive  to  the 
signal  processor  designer  for  the  following  reasons: 

a)  It  combines  low  power  with  high  speed  by  using  complementary 
metal  oxide  semiconductor/ silicon-on-sapphire  (CMOS/SOS) 
technology  to  give  a very  low  speed-power  product, 

b)  It  is  an  8-bit-sIice  (versus  a 4-bit-slice)  device  and  therefore 
requires  fewer  components  to  realize  a total  system, 

c)  It  has  provisions  for  a peripheral  special  function  unit  (SFU) 
which  can  be  a high-speed  multiplier  device,  for  example. 

Unfortunately,  the  ATMAC  8-btt-sllce  microprocessor  data  throughput 
is  limited  by  the  Incorporation  of  only  a single  bus  I/O  structure.  This 
appears  to  be  its  greatest  architectural  weakness.  Discussions  with 
RCA  personnel  generally  confirm  this  limitation,* 

B,  Multi-Microprocessor  Architectures 

Arrays  of  low-cost  microprocessors  performing  multiple 
computational  tasks  in  parallel  have  been  considered  as  an  alternative 
for  achieving  higher  data  throughput  rates  in  radar  signal  processing 
applications.  The  obvious  advantage  of  such  an  approach  is  the  redundancy 
inherent  in  such  a design  which  can  lead  to  a more  survivable  processor 
in  case  of  component  failures.  The  obvious  disadvantage  of  this  approach 
is  the  difficulty  in  programming  such  a structure. 

Arrays  of  8-blt  MOS  microprocessors  are  potentially  more  attractive 
than  arrays  of  4-bit-slice  bipolar  designs  because  of  the  lower  parts 
count  and  generally  lower  cost  of  the  8-blt  processors.  Unfortunately, 
the  8-blt  devices  have  a single  I/O  bus  while  the  4-bit  parts  have  one 
or  two  operand  buses  and  a resultant  bus, 

C,  Task-Allocated  Processor  Structures 

In  radar  signal  processing,  various  tasks  of  differing 
complexity  and  speed  must  be  performed.  One  possible  way  to  arrive  at 
a radar  signal  processor  architecture  is  to  determine  the  needs  of  each 
processing  subtask  and  to  create  the  required  computational  resources 
necessary  to  accomplish  those  tasks.  This  approach  will  be  referred  to 
in  this  discussion  as  "task-allocated"  signal  processing. 


*Helbig,  Walter,  RCA  Advanced  Technology  Laboratories,  Camden,  New 
Jersey,  August  1977  (Private  Communication). 


At  the  crux  of  the  radar  digital  signal  processing  problem  are  the 
high  data  throughput  requirements.  Therefore,  a logical  place  to  begin 
in  defining  a task-allocated  structure  Is  with  those  subtasks  which  have 
the  most  demanding  throughput  requirements.  In  most  coherent  radars, 
filtering  operations  require  the  greatest  number  of  arithmetic  operations 
in  the  shortest  time  interval.  These  operations  generally  consist  of; 

1)  High-pass  filtering  —Moving  Target  Indication  (MTI) , 

2)  Low-pass  filtering  (clutter  maps). 

3)  Band-pass  filtering  (Doppler  filtering). 

The  high-pass  and  low-pass  filters  may  be  synthesized  using  well- 
known  finite  impulse  response  (FIR)  and  infinite  impulse  response  (HR) 
techniques  [13-15].  The  band-pass  filters  required  for  Doppler  processing 
are  usually  realized  with  the  fast  Fourier  transform  (FFT)  algorithm 
[14,  15). 

The  top-down  approach  taken  in  this  study  was  to  determine  how  fast 
various  microprocessor  and  special  purpose  hardware  structures  could 
perform  the  required  computations  necessary  to  accomplish  these  filtering 
operations.  Filter  order  (weights)  and  transform  length  were  used  as 
parameters  of  interest,  A priori  knowledge  of  the  Quiet  Radar  performance 
requirements  were  used  to  determine  approximate  filter  orders  and  trans- 
form lengths  of  interest.  Arithmetic  precision  was  initially  assumed  to 
be  16-blts,  It  is  expected  that  further  work  will  be  performed  to 
determine  the  validity  of  this  assumption. 

The  first  task  to  be  considered  is  that  of  MTI  filtering.  Such 
filters  may  be  realized  as  either  conventional  N-pulse  cancellers  which 
require  that 


y 


n 


a X 
o n 


a.  X 
1 n-1 


rH*l  n-1 


(1) 


where  X - N input  data  word.  In  the  simplest  case,  the  coefficients, 
n 

a.,  are  unity.  Thus,  the  simplest  two-pulse  canceller  requires  only  a 
single  subtraction  for  each  return  pulse. 

More  sophisticated  MTI  filters  may  be  realized  using  FIR  synthesis 
techniques.  Furthermore,  FIR  designs  can  be  high-pass,  low-pass,  or 
band-pass,  depending  upon  the  coefficients  selected.  Figure  1 presents  the 
computational  requirements  of  an  FIR  design.  The  structure  depicted 
in  this  figure  computes 
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The  structure  depicted  in  Figure  1(a)  may  be  redrawn  as  shown  in  Figure 
1(b),  The  usefulness  of  this  representation  will  be  shown  later  in  this 
discussion.  The  coefficients  for  these  representations  can  be  determined 
using  well-documented  design  techniques  [15,  16], 


Tlie  second  filter  realization  technique  to  be  considered  is  the 
HR  filter.  The  recursive  nature  of  this  design  is  clearly  illustrated 
in  Figure  2,  Basically,  this  structure  computes 


= X 


+ a.X  , + a„X  „ - b,y  , 
1 n-1  2 n-2  l^n-1 


(3) 


The  coefficients  for  this  second-order  filter  section  may  be  determined 
from  existing  HR  filter  design  programs  [17], 

The  recursive  filter  has  the  advantage  of  requiring  lower  orders 
to  achieve  sharper  cutoff  responses  than  the  FIR  designs.  Its  major 
disadvantages  are  its  nonlinear  phase  response  and  the  generally  greater 
coefficient  word  sizes  necessary  to  insure  stability  (l.e,,  to  minimize 
limit  cycle  and  overflow  oscillations). 

As  in  the  case  of  the  FIR  filter,  the  HR  representation  shown  in 
Figure  2(a)  may  be  redrawn  as  illustrated  in  Figure  2(b),  Again,  the 
usefulness  of  this  exercise  will  become  apparent  in  the  later  discussion. 

The  third  filter  to  be  considered  is  the  band-pass  filter.  Band- 
pass filters  for  radar  Doppler  processing  are  frequently  realized  using 
the  FFT  algorithm.  The  heart  of  the  FFT  process  is  the  basic  computa- 
tional element  depicted  in  Figure  3(a),  This  structure  solves  the 
following  equations: 

a'  = a + c COS0  - d sinS 

b'  “ b + d cose  + c sine 

c'  ar  a - c cose  + d sine 

d'  = b - d cose  + d sine  , (4) 

As  in  the  case  of  the  FIR  and  HR  structures,  the  FFT  elemental  com- 
putations may  be  redrawn  as  shown  in  Figure  3(b), 

The  redrawn  filter  computational  structures  depicted  in  Figures 
1(b),  2(b),  and  3(b)  can  now  be  compared.  The  similarities  are  quite 
apparent.  These  structures  represent  the  most  computationally  demanding 
algorithms  found  in  coherent  radar  processing.  The  properties  evident 
here  can  be  exploited  to  serve  as  a rational  basis  for  a high-speed  radar 
signal  processor  design  as  shown  in  the  following  discussion. 


The  arithmetic  elements  common  to  the  redrawn  FIR,  HR,  and  FFT 
structures  are  four  multipliers  and  several  adders.  Therefore,  taking 
the  union  of  the  three  configurations  shown  in  Figure  1(b),  2(b),  and 
3(b)  and  minimizing  the  functional  arithmetic  elements  results  in  a basic 
signal  processing  structure  which  can  accommodate  the  high  throughput 
algorithms  required  by  coherent  radars.  Two  important  questions  are 
(1)  how  to  route  the  data  efficiently  to  and  from  the  high-speed  computa- 
tional elements,  and  (2)  how  to  store  and  buffer  I/O  and  partial  result 
data  effectively. 

The  approach  taken  to  data  routing  in  the  proposed  special  purpose 
processor  is  to  use  multiple  parallel  data  buses  to  avoid  the  common 
problem  of  bus-limited  data  transfers.  Such  an  approach  has  the  potential 
for  achieving  100%  computational  efficiency  by  supplying  data  continuously 
to  the  arithmetic  computational  elements. 

The  problem  of  special  purpose  processor  data  storage  can  be  met  with 
high-speed,  distributed  memory  capable  of  accepting  data  in  parallel  from 
multiple  buses.  An  important  advantage  of  distributed  memory,  in  addition 
to  having  multiple  I/O  ports  available  to  accommodate  pipelined  data,  is 
the  ability  to  do  data  steering  (switching)  by  clever  memory  addressing 
techniques.  If  the  special  purpose  processor  arithmetic  unit  Is  envisioned 
as  a miniswitching  system,  the  data  storage  elements  can  be  used  In  much 
the  same  way  as  In  large  electronic  switching  systems  such  as  the  Bell 
System's  Electronic  Switching  Systems  (ESS)  machine. 

The  final  important  concept  to  be  discussed  briefly  In  addition  to 
arithmetic,  data-bus,  I/O,  and  memory  is  that  of  special  purpose  proces- 
sor control.  To  achieve  maximum  flexibility  and  to  guarantee  that  the 
resulting  structure  will  compute  the  FIR,  HR,  FFT,  and  other  signal 
processing  algorithms  efficiently,  it  is  proposed  that  the  basic  control 
be  microprogrammable.  In  a signal  processing  structure  such  as  that 
proposed  here,  the  microprogram  object  code  may  be  thought  of  as  data 
switch  enables/disables.  Thus,  the  data  steering  function  is  controlled 
by  the  processor  microprogram.  The  microp  'ogram  itself  can  reside  either 
in  Read-Only-Memory  (ROM),  assuming  that  all  processing  algorithms  to  be 
executed  are  known  a priori  or  It  can  reside  In  Random  Access  Memory 
(RAM)  which  can  be  loaded  by  a more  intelligent  machine  such  as  a micro- 
processor or  minicomputer. 

The  union  of  all  the  ideas  briefly  outlined  in  the  preceding 
paragraphs  have  been  incorporated  in  a proposed  radar  signal  processing 
structure  which  is  illustrated  in  Figure  4,  It  is  proposed  that  the 
special  purpose  arithmetic  unit  be  constructed  of  computational  and 
memory  components  which  will  permit  a 200-nsec  pipelinsd  throughput. 

Under  the  assumption  that  such  a processor  can  be  constructed,  the 
resulting  FIR,  HR,  and  FFT  data  throughput  rates  as  a function  of  filter 
order  and  FFT  transform  length  can  be  determined. 
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Figure  4,  Proposed  radar  signal  processor  arithmetic  unit 
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III.  BASELINE  PROCESSING  REQUIREMENTS 

The  baseline  processing  requirements  for  this  study  are  pro- 
vided by  the  Advanced  Sensors  Directorate's  Quiet  Radar,  This  Is  a ! 

phase-coded  continuous  wave  (CW)  radar  with  the  following  characteristics; 


a) 

Code  shift  rate  = 5 MHz. 

b) 

Code  length  = 63  bits. 

' c) 

Antenna  dwell  time  = 2 msec. 

\ 

Number  ofCode  Periods  per  dwell 

e) 

Doppler  Coverage  = t 25  kHz, 

' f) 

Number  of  Doppler  lines  * 100. 

A number  of  specific  Quiet  Radar  processor  configuration  alternatives 
have  been  carefully  considered  by  the  US  Army  Missile  Research  and 
Development  Command  (MIRADCOM)  and  therefore  will  not  be  Iterated  here, 

Tlie  Important  parameters  to  note  are  the  code  shift  rate  (5  MHz),  the 
antenna  dwell  time  (2  msec)  and  the  Doppler  cutoff  frequency  (25  kHz), 

Based  upon  a processing  interval  of  2 msec  and  63  range  cells/dwell, 
the  per  range  cell  computation  interval  is  31,7  psec  complex  or  15,8 
psec  per  real  channel.  This  computation  interval  may  be  compared  to  the 
times  required  to  compute  various  orders  of  FIR,  HR,  and  FFT  transform 
lengths  discussed  in  the  following  sections, 

IV.  REQUIREMENTS  VERSUS  SIGNAL  PRQCESSQR 

ARCHITECTURE  TRADEQFFS 

With  the  Quiet  Radar  processing  requirements  as  a reference 
point,  the  signal  processor  architectures  described  earlier  may  be  con- 
sidered, The  following  sections  discuss  single  processors,  multiproces- 
sors, and  the  task-allocated  signal  processor  approaches, 

A,  Single  Processors 

The  single  microprocessors  considered  in  this  study  were 
the  8-blt  MOS,  4-blt-slice  bipolar,  and  8-blt-8llce  CMOS/SOS  devices. 

To  determine  their  suitability  for  computing  the  signal  processing  algo- 
Ithms  discussed  earlier,  i.e. , the  FIR  filter,  HR  filter,  and  FFT,  these 
algorithms  were  either  encoded  and  implemented  to  provide  benchmark 
timing  requirements  or,  where  possible,  were  taken  from  the  literature. 

The  Intel  8080A,  which  has  become  an  industry  standard  8-bit  micro- 
processor was  chosen  to  provide  a baseline  upon  which  the  other  approaches 
may  be  compared.  This  is  a reasonable  choice  because  many  versions  of 
the  8080A  exist  in  the  form  of  high-speed  bipolar  emulators  as  well  as 
software  upward  compatible,  higher  speed  devices  such  as  the  Z-80 
microprocessor. 
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' Figure  5 depicts  the  8080A  baseline  throughput  capability  for  three 

possible  configurations.  These  configurations  are  as  follows: 

1)  8080A  with  software  multiply, 

2)  8080A  with  the  AMD  AM9511  arithmetic  processor  chip, 

I 3)  8080A  with  a high-speed  peripheral  multiplier  such  as  the  TRW 

single  chip  devices. 

As  could  be  anticipated,  the  8080A  with  software  multiply  only  is  by 
far  the  slowest  configurations.  For  example,  this  configuration  requires 
approximately  10  msec  to  compute  a 16-weight  FIR  filter.  The  details 
of  the  8080A  configuration  and  the  software  used  establish  this  benchmark 
1 are  given  in  Appendices  A and  B of  this  report, 

i 

i Figure  5 shows  that  a half-order  magnitude  speed-up  can  be  achieved 

I using  the  8080A  augmented  with  the  AMD  arithmetic  unit  (AM9511) . The 

i configuration  used  is  described  in  detail  in  Appendix  A. 

j The  third  8080A  configuration  considered  was  the  8080A  coupled  with  a 

I high-speed  multiplier.  This  configuration  results  in  a throughput 

increase  which  is  close  to  an  order  of  magnitude  faster  than  the  8080A 
with  software  multiply.  It  is  significant  that  the  increase  in  through- 
put achieved  by  the  8080A  with  the  high-speed  multiplier  relative  to  the 
8080A  with  the  AMD  device  is  not  as  great  as  expected.  The  fundamental 
reason  for  this  is  that  as  the  peripheral  devices  become  faster,  the 
basic  throughput  limitation  becomes  I/O  bound.  This  is  true  in  the  8080A 
case  even  with  memory  mapped  I/O, 

It  should  be  pointed  out  that  faster  versions  of  the  8080A  would 
shift  these  curves  down  in  a corresponding  manner.  Additional  speed-up 
could  also  be  achieved  with  an  expander  instruction  set  such  as  that 
available  with  the  Z-80,  However,  as  seen  later  in  this  discussion,  the 
overall  throughput  increase  would  not  be  consequential  relative  to  most 
coherent  radar  processing  requirements. 

Two  additional  single  processor  architectures  were  considered  for 
reference.  The  first  of  these  is  the  Motorola  MOD  System  which  is 
basically  an  8-bit  processor  composed  of  two-slices  of  the  M10800 
I microprocessor  [8],  Its  performance,  assuming  the  use  of  Booth's 

algorithm  to  perform  the  software  multiplies  is  shown  in  Figure  5,* 

I It  can  be  observed  that  for  the  FIR  algorithm,  the  throughput  can  be 

' increased  by  a factor  of  10  over  the  baseline  8080A  using  Booth's 

algorithm.  However,  it  is  important  to  recall  that  this  increased 
throughput  is  achieved  at  the  expense  of  the  much  higher  component  parts 
count  required  by  bit-slice  microprocessors  as  well  as  more  complex 
software. 


*Balph,  Tom,  Motorola,  Inc,,  Phoenix,  Arizona,  September  1977  (Private 
Coanunication) , 
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Because  the  Motorola  MOD  system  Is  basically  an  8-bit  processor  and 
the  arithmetic  precision  of  interest  here  is  16-bits,  it  is  reasonable 
to  exploit  the  bit-slice  microprocessor  technolgoy  data  width  expansion 
capability,  Raytheon  has  done  this  in  their  "Common  Element"  approach 
processor  which  is  composed  of  multiple  slices  of  the  M10800  4-blt-slice 
microprocessor  [9],  In  addition  to  the  increased  throughput  achieved  by 
doing  single  precision  arithmetic,  Raytheon  has  configured  their  machine 
to  do  single  instruction-per-bit  multiplies.  In  this  case,  the  multiplies 
are  12-blt  x 12-blt,)  Again,  this  Illustrates  the  power  of  micropro- 
gramming coupled  with  fast  cycle  times  achievable  with  bipolar  bit-slice 
microprocessors. 

In  the  case  of  the  FIR  algorithm,  the  Common  Element  processor 
achieved  nearly  three  orders  of  magnitude  increase  in  throughput  relative 
to  the  baseline  8080A  with  software  multiply.  This  is  illustrated  with 
number  of  filter  weights  as  a parameter  in  Figure  5, 

B,  Multiprocessors 

An  idea  of  the  throughput  achievable  with  arrays  of  micro- 
processors assembled  in  a multiprocessor  configuration  can  be  inferred 
from  the  performance  curves  of  the  single  processors  given  in  Figure  5, 

A simplistic  view  of  this  approach  is  simply  to  divide  the  processing  time 
for  a single  processor  by  the  number  of  microprocessors  in  the  assembled 
array.  This  view,  while  Indeed  simplistic,  does  infer  something  about 
an  upper  bound  on  multiprocessor  throughput  performance. 

Continuing  with  this  idea,  it  could  be  postulated  that  ten  8080A's 
in  a properly  configured  array  could  achieve  the  same  throughput  as  a 
single  Motorola  MOD  System  as  seen  from  Figure  5,  Based  upon  the  same 
reasoning,  two  8080A*8  with  AM9511  arithmetic  processors  could  also 
achieve  the  same  throughput  as  the  Motorola  MOD  System  and  probably  at  a 
much  lower  parts  count. 

Based  upon  this  same  reasoning,  it  can  be  concluded  that  an  array 
of  nearly  300  8080A's  with  software  multiply  would  be  required  to  achieve 
the  same  throughput  as  a Common  Element  processor.  It  can  be  quickly 
seen  that  the  number  of  slower  processors  required  to  achieve  nearly  the 
same  processing  speed  as  a single,  faster  machine  rapidly  becomes  very 
high.  Consequently,  it  can  be  concluded  that  multiprocessors  composed 
of  low-speed  processing  elements  are  not  likely  to  be  very  efficient 
in  high  data  throughput  applications. 

The  following  section  considers  a variation  of  the  multiprocessor 
approach,  where  a mixture  of  high-speed  LSI  and  intelligent  microproces- 
sor logic  is  used  to  achieve  flexible,  high-speed  processing. 
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Task-Allocated  Processing 

The  lower  bound  on  data  throughput  for  purposes  of  this 
study  is  achieved  by  the  special  purpose  radar  processor  described  in 
Section  II, C,  Figure  5 illustrates  that  such  a design  can  potentially 
realize  data  throughput  rates  which  are  four  to  five  orders  of  magnitude 
faster  in  computing  the  FIR  algorithm  than  the  baseline  8080A  processor 
discussed  earlier  (Section  II, A and  Appendices  A and  B) , The  fact  that 
the  proposed  special  purpose  processor  is  approximately  two  orders  of 
magnitude  faster  when  computing  the  FIR  algorithm  than  the  M10800  Common 
Elements  approach  proposed  by  Raytheon  is  illustrated  in  Figure  5, 

Viewed  another  way,  this  says  that  if  100  Common  Element  processors  are 
required  to  achieve  the  necessary  data  throughput,  only  one  special  pur- 
pose processor  can  replace  all  100  Common  Element  processors. 

The  central  issue  here  is  obvious  tradeoff  between  processor  data 
throughput  and  processor  flexibility.  However,  a great  deal  of  flexi- 
bility can  be  achieved  with  a special  purpose  arithmetic  unit  through 
the  use  of  programmable  control.  The  control  structure  of  the  processor 
shown  in  Figure  4 essentially  performs  a data  routing  role.  Thus,  by 
making  these  operations  microprogrammable,  the  special  purpose  unit  can 
be  made  to  perform  a large  number  of  different  algorithms  and  thereby 
overcome  processor  flexibility  limitations.  It  is  proposed  that 
additional  work  be  undertaken  to  define  specifically  the  nature  of  this 
control  structure.  One  possibility  which  should  be  studied  further  is  the 
incorporation  of  a microprocessor  to  perform  such  control  functions, 

A second  potential  application  for  a microprocessor  in  a task- 
allocated  structure  is  as  a post-processor.  That  is,  after  the  high- 
speed algorithm  processing  has  been  accomplished  by  the  special  purpose 
LSI  (l,e,,  the  multipliers,  adders,  high-speed  memory,  etc,)  more 
sophisticated,  but  lower  throughput,  processing  is  usually  required. 

For  example.  Constant  False  Alarm  Rate  (CFAR)  processing  with  associated 
thresholding  and  clutter  map  generation  may  be  accomplished  in  a 
moderate-speed  microprocessor.  This  area  is  also  identified  as  one 
where  additional  work  is  needed. 

Figure  6 represents  the  same  benchmark  approach  to  processor 
comparison  illustrated  in  Figure  5 except  that  Figure  6 is  for  the  HR 
algorithm.  The  two  figures  are  similar,  but  results  are  given  for 
different  filter  orders,  A comparison  of  the  FIR  and  HR  algorithms 
shows  that  the  basic  memory  access  required  for  each  is  first-in,  first- 
out  (FIFO),  In  addition,  it  has  already  been  shown  that  the  computational 
elements  themselves  are  similar  (Figures  1 through  3),  Therefore,  it 
is  reasonable  for  the  benchmark  throughput  results  to  be  similar. 
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Figure  6,  HR  filter  throughput. 
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Tiie  most  computationally  demanding  coherent  radar  processing 
algoritlim  the  band-pass  Doppler  filtering  often  accomplished  using 
tlie  FFT  algorithm.  Figure  7 presents  the  results  obtained  using  the 
signal  processor  configurations  described  earlier.  The  length  transforms 
considered  were  from  8 to  256  points.  Earlier  work  has  shown  that 
realistic  transform  lengths  for  the  Quiet  Radar  are  from  64  to  256  points,* 

It  can  be  observed  from  Figure  7 that  for  a 128-potnt  transform, 
for  example,  the  special  purpose  processor  is  nearly  four  orders  of 
magnitude  faster  than  the  baseline  8080A  processor  with  software  multiply. 
As  in  the  case  of  the  FIR  and  HR  algorithms,  the  bit-slice  processors 
fall  between  these  bounds.  One  additional  interesting  FFT  benchmark 
shown  in  Figure  7 is  the  RCA  8-bit-slice  ATMAC  processor  with  SFU  {18], 

Tlie  throughput  achieved  with  the  RCA  device  is  comparable  to  that 
achieved  with  the  Raytheon  Common  Element  approach  in  the  case  of  the 
FFT  algorithm.  However,  this  throughput  is  achieved  with  a lower  parts 
count  because  the  ABiAC  represents  a higher  level  of  circuit  integration 
(i,e,,  8-bit-slice  versus  4-blt-slice) , Details  of  this  processor  are 
given  in  References  6 and  10, 


V.  RECOMMENDATIONS  AND  CONCLUSIONS 


The  general  problem  of  realizing  a flexible,  high  throughput 
signal  processor  for  coherent  radar  applications  has  been  considered. 

The  approach  taken  in  this  study  has  been  to  investigate  various  micro- 
processor configurations  and  to  evaluate  their  capabilities  through  the 
use  of  benchmark  radar  signal  processing  algorithms.  As  a baseline  con- 
figuration, the  popular  8080A  microprocessor  was  chosen.  Various  con- 
figurations of  the  8080A  with  software  multiply  only  and  versions  of 
the  8080A  augmented  with  special  peripheral  arithmetic  hardware  were 
considered.  The  signal  processing  algorithms  of  interest  were  programmed 
on  these  machines  and  their  throughput  capabilities  determined.  Both  the 
AMD  AM9511  arithmetic  processor  and  a high-speed  peripheral  multiplier 
were  used  to  augment  the  basic  8080A. 

The  ability  of  bit-slice  microprocessors  to  process  the  benchmark 
coherent  radar  algorithms  was  evaluated  as  a part  of  this  study.  The 
particular  4-bit  microprocessor  considered  was  the  faster  cycle  time 
device  currently  commercially  available,  namely,  the  Motorola  M10800, 

The  two  configurations  evaluated  and  compared  to  the  baseline  8080A 
processor  were  the  Motorola  MOD  System  and  the  Raytheon  Common  Element 
processor.  Where  data  were  available,  the  RCA  8-bit  ATMAC  processor  was 
also  considered, 

*Burlage,  Don,  US  Army  Missile  Research  and  Development  Command,  Redstone 
Arsenal,  Alabama,  September  1977  (Private  Communication), 
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Figure  7,  FFT  throughput. 
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I'Lnally,  a li  Lj’li-spood  arlLlum’LLc  unit  cuinprlsi'cl  uf  special  purpose 
LSI  clrcullry  was  conllj’ureil  aiul  its  tliroup.liput  eapahllltles  evalviatcd, 
riie  raLlou.al  basis  li>r  ibis  lieslp.u  was  (lie  slinllarlly  of  tlic  coliereuL 
radar  slipial  processtu}’,  alp.oi'ULms  ol  Interest,  Tlie  llirounbput  pcrlorm- 
ance  ol  tbe  slnp.le  processors,  multiprocessors,  and  task-allocated  pro- 
cessors were,  considered  with  1 liter  order  and  transform  Icngtii  as 
parjimeters  , 

The.  major  conclusion  ol  tl\ls  st\idy  Is  tiiat  a carefully  configured 
ci>ml)lnatlon  of  lilgli-speod , special  purpose  LSI  cou|)led  wltli  distributed 
Intelligence  in  tlie  form  of  microprocessors  can  effectively  meet  tlie 
tlirougliput  and  flexibility  reciulrcments  of  colierent  radars.  More 
specifically.  It  lias  been  determined  tliat  sucli  an  embodiment  can  meet 
tlie  processing  needs  of  tlie  Quiet  Radar,  It  Is  therefore  recommended 
that  add  1 1 Iona  1 work  be  undertaken  to  answer  remaining  performance 
Huesttons  and  that  following  this  effort,  such  a processor  be  constructed 
and  Interfaced  with  the  Quiet  Radar, 
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Appendix  A.  8080A  PROCESSOR  BENCHMARK  CONFIGURATIONS 


pkeceqino  faos  hlmx 


The  equation  solved  by  the  FIR  class  of  digital  filters  is 


h X 

k n-k 


(A-1) 


where  h,  are  the  filter  coefficients  and  x are  the  samples.  From  this 
k n 

expression,  it  is  seen  that  two  tables  are  required:  one  for  storing  the 

coefficients  h,  , and  one  for  storing  x , . The  table  for  the  coefficients 
k n-k 

could  be  ROM  or  RAM:  however,  the  table  for  x , is  RAM  and  as  will  be 

n-k 

shown  later,  preferably  a K level  circulating  FIFO,  If  RAM  is  used,  then 
this  FIFO  is  implemented  by  software. 

The  coefficients  are  stored  as  16-bit  two-complement  numbers.  The 
samples  are  8-bit  two-complement  numbers.  Hence,  the  need  for  a 16-  by 
8-bit  multiplier.  This  multiplier  will  be  Implemented  presently  by 
software. 

The  algorithm  necessary  for  the  computation  of  y^  is  straightforward, 

A flowchart  is  given  in  Figure  A-1,  The  software  multiply  algorithm 
chosen  for  implementation  in  the  8080A  processor  is  the  well-known 
Booth's  algorithm.  While  more  efficient  mechanizations  may  be  known, 
this  algorithm  is  representative.  The  following  discussion  briefly 
describes  Booth's  multiplication  procedure. 

In  two-complement  form,  X can  be  represented  as 


X . -2"_  . 2 


and  Y can  be  represented  as 


Y = -2  y + 
■^m 


III-  J. 

I 


III”  i. 


then 


nri-m 

XY  = 2x  (y 

n m-l 


n-1  n-1 

2 2 

j=0  k-O 


m-l  n-1 

2 ■ >'3'  2'* 


j-O 


2^2“  (yj.i  - yi)\ 


Therefore,  to  multiply  X by  Y, 


k=0 


1) 

If 

^j-1 

= y 

accumulc 

itor 

significant 

2) 

If 

"i-1 

= 1 

3) 

If 

"j-1 

= 0 

3)  If  y.  , = 0 and  y,  ,,  then  X Is  subtracted  from  the  accumulator, 

j-1  j=l 

An  Intel  8080  assembly  language  routine  was  written  to  implement 
the  algorithm  depicted  in  Figure  A-2,  In  one  routine,  called  MULT,  x is 
stored  in  the  register  pair  DE  and  y is  in  the  8-blt  accumulator , A, 

The  16-blt  accumulator  is  the  register  pair  HL  where  the  result  is 
obtained.  After  the  multiplication  process,  y^  ^ is  saved  in  the  carry 

bit.  Therefore,  multiple  byte  multiplication  is  possible.  The  routine, 
MULT  16,  multiplies  the  16-blt  two -complement  numbers  in  DE  and  BC  and 
forms  the  result  in  HL.  That  is. 


HL^  DE  X BC. 


The  seven  least  significant  bytes  of  the  results  are  truncated 
using  MULT.  Therefore,  the  actual  result  of  the  multiplication  is 

xy  » HL  X 2^  * HL  X 128  . 

The  routine  was  tested  and  the  following  results  were  obtained: 

X “ C000  Hex  - 4000  Hex  * -16384^^ 
y » 55  Hex  ■ + 85j^q 

HL  - D580H  - -2A80H;  HL  x 128  - -1392640 

Check:  85  x (-16384)  - -1392640 
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The  preceding  multtpLlcatlon  required  500  ,.sec 

X = COOOH  = -4000H  = -16384 
y = AAH  = -56H  = -86 
xy  = (-86  (-16384)  = 1409024 
HL  = 2B00H 

HL  X 80H  » (2B00)„  *(80),,  = 1409024  (Check) 

H It 

To  accomplish  the  data  manipulations  required  by  FIR  filters.  It 
Is  convenient  to  define  an  FIFO  memory  file.  The  FIFO  routine  accepts 
a sample  from  the  accumulator  and  stores  It  at  the  top  of  a table  shifting 
all  samples  In  the  table  one  location  down.  The  last  sample  Into  the 
table  when  FIFO  Is  called  Is  dumped  Into  a garbage  collecting  location. 

The  operation  of  the  FIFO  Is  Illustrated  In  the  Figure  A-3, 


ACC 


BOTTOM 
OF  FIFO 


TOP  FF 

’‘n 

■*-  TOPFF  ^ 

^n+1 

^-1 

Xn 

^-1 

• 

t 

KH 

H 

% 

• 

• 

' 

r 

^-K+1 

><n-k 

A 

7‘ 
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Figure  A-3,  FIFO  operation. 


The  operation  Is  best  understood  by  referring  to  the  source  listing 
given  In  Appendix  B,  It  Is  sufficient  to  say  that  to  move  each  sample  to 
the  next  lower  position.  It  is  first  moved  from  memory  into  the  accumu- 
lator and  from  there  It  is  moved  to  the  next  location.  There  are  two 
pointers  HL  and  DE,  HL  points  to  the  sample  and  DE  points  to  the  next 
location. 


The  HR  filter  may  be  described  by 
z z 

h V 

•k 


i.  z 

■ Z “k  Vk  ■ I '■k  >'„-l 


(A-2) 


k-0 


k-1 


where  a and  b are  coefficients  and  x are  samples, 
k k n 


32 


To  evaluate  the  preceding  expression,  it  is  expanded  as 


a X + a, 
o n 


X , + a,  X , - b, 
n-1  2 n-2 


- >>2  >’„-2 


(A-3) 


where  a^»  a^,  a^,  bj^,  and  b2  are  stored  in  ROM  or  RAM  as  16-blt  two- 

complement  numbers.  The  results  are  also  16-bit  numbers,  are 

8-bit  two -comp lament  numbers.  It  is  seen  that  both  16-  by  16-bit  and 
16-  by  8-bit  multiplication  is  required.  Also,  care  must  be  taken  when 
adding  the  elements.  Scaling  must  be  taken  into  account  because  the 
results  of  16-  by  8-bit  and  16-  by  16-bit  multiplications  are  added. 

It  is  assumed  that  the  coefficients  are  scaled  such  that  the  elements 
can  be  directly  summed. 


To  store  x , x , and  x . three  locations  called  XN0,  XNl,  and 
n n-i  n-2 

XN2  are  used.  When  a new  sample  is  given,  then  the  sample  in  XN0  moves 
to  XNl  and  XNl  to  XN2  while  the  new  sample  is  stored  in  XN0.  The  third 
sample  ^ is  dumped.  The  HR  routine  evaluates  the  expression  according 

to  the  flowchart  given  in  Figure  A-4, 

On  entry  to  the  HR  after  Initialization  by  calling  IIRl,  BC  must 

contain  y _ and  HL,  y , , This  condition  is  true  when  the  HR  routine 
■'n-2  ’ ■'n-1 

returns  to  the  calling  program.  Therefore,  care  must  be  taken  not  to 
destroy  these  registers. 


1,  PROCESSING  TIME 

The  processing  time  of  each  routine  is  given  in  terms  of  clock 
cycles.  For  the  Intel  8080A  standard  package,  a typical  clock  cycle  is 
500  nsec.  (K  = K'  + 1)  where  K’  = order  of  filter. 


Routine 


Clock  Cycles 


MULT 

FIFO 

FIR0 

FIRl 

IIR0 

IIRl 

FIR  W/AMD  9511 
W/0  DMA 


MIN  = 928  MAX  » 1143 
88  + 42  K 

171  + 195  K + MULT  X K where  MULT  1140 
150  + 170  K -f  MULT  X K 

502  + 2 X MULT  16  + 3 X MULT  where  MULT  16  « 2300, 
MULT  « 1140 

427  2 X MULT  16  + 3 X MULT 

171  + 388  K 
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Figure  A-4.  HR  filter  algorithm. 
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Routine 


Clock  Cycles 


AM951 1 requires  92  elk  cycle  multiply 
+ 101  cycles  load 

400  + 345  K + FIFO 
1238  cycles 

(Total) 

2.  FIR,  HR  FILTER  TIMING 

The  following  are  the  times  in  cycles  required  to  process  each  sample 
th 

for  K order  FIR  and  HR  filters, 
a,  FIR  Filter 

T = 55  + FIFO  + (135  + MULT)  K (A-4) 

where  FIFO  is  a software  first-in  first-out  routine,  (FIFO  = 88  + 42K) , 
MULT  is  the  multiplication  time.  The  8-bit  by  16-bit  multiplication  time 
depends  on  whether  it  is  Implemented  in  software  or  hardware.  The  follow- 
ing table  gives  the  multiplication  time  in  cycles  for  the  configuration 
listed. 


FIR  W/AMD  9511 
W/DMA 

HR  W/AMD9511 
W/0  DMA 


Configuration 

MULT  Cycles 

Software  8-  by  16-bit 

1140  average 

Memory  mapped  hardware  with 

58 

hardware  multiplication  of 
(11  cycles  or  5,5  psec) 

9511  APU  Memory  mapped-no 

172 

interrupt 

b,  HR  Filter 

T =■  82  + [ 761  + 5*  MULT  16] 

MULT  16  is  a 16-  by  16-bit  multiplicati 

on 

Configuration 

Cycles 

Software  (16-  x 16-bit) 

2300 

Hardware  memory  mapped 

58 

APU  memory  mapped-no  Interrupt 

172 

(A-5) 
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16-  X 16-Bit  Multiplication  Hardware  Multiplication  Alternatives 

The  purpose  of  this  discussion  is  to  outline  briefly  the 
hardware  requirements  of  a memory  mapped  16- x 16-bit  hardware  mul  tiplication 
scheme.  In  addition,  the  advantages  of  memory  mapped  versus  isolated 
I/O  are  explained  by  comparing  the  software  necessary  for  moving  data 
to  and  from  the  multiplication  unit  for  each  configuration.  Hence,  the 
software  required  for  the  operation  HL  = HL*BC  for  each  configuration  are 
examined  first. 

a.  Isolated  I/O 

An  isolated  I/O  configuration  is  considered  where  data 
are  sent  to  and  received  from  an  I/O  device  through  the  accumulator. 

Figure  A-5  shows  a block  diagram  of  a 16-  by  16-bit  hardware  multipli- 
cation unit.  The  I/O  ports  XL,  XH,  YK,  and  YH  can  be  either  isolated 
I/O  ports  or  memory  locations  (memory  mapped  I/O).  The  necessary  soft- 
ware and  the  corrspondlng  cycles  per  instruction  required  to  perform 
the  operation  HL  ♦-  HL*BC  are  given  in  Figure  A-5, 

Cycles/Instructlon 


5 

MOV 

A,C 

> 

10 

OUT 

XL 

; XL  = C 

5 

MOV 

A,B 

> 

10 

OUT 

XH 

> 

5 

MOV 

A,L 

; XL  = B 

10 

OUT 

YL 

> 

5 

MOV 

A,H 

; YL  = L 

10 

OUT 

YH 

9 

10 

IN 

YL 

; Y4  » H 

5 

MOV 

L,A 

; Y = Y*X  + HL  *BC 

10 

IN 

YH 

; HL  -Y 

5 

MOV 

H,A 

; HL  - HL*BC 

In  the  preceding  routine,  XL  Is  the  address  of  the  lower  8-blt  latch 
of  X and  XH  Is  the  address  of  the  higher  8-blt  latch  of  X,  The  same 
applies  to  Y2  and  YH.  Assuming  that  the  hardware  multiplier  has  a 
multiplication  time  of  less  than  eight  cycles  (this  Is  the  time  In  cycles, 
between  which  YH  Is  loaded  with  H by  the  Instruction  OUT  YH  and  when  Y^ 

1j 

must  be  put  on  the  data  bus  during  the  Instruction  IN  YL) , the  program 
requires  90  cycles  of  execution.  In  the  light  of  the  preceding  discus- 
sion, memory  mapped  I/O  Is  examined  next. 
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ADDRESS 


figure  A-5.  I/O  configuration  for  isolated  or  memory  mapped  hardware  multiplication. 


b.  Memory  Mapped  1/0 


In  this  case  XL,  XII,  YL,  and  YU  are  actually  locations  in 
memory.  In  storing  data  to  and  from  these  locations,  they  are  treated 
as  memory  locations.  Before  the  software  necessary  for  performing 
HL  • HL*BC  is  presented,  a comment  sliould  be  made  about  the  addresses 
XL,  XH,  YL,  and  YH,  The  8080  microcomputer  has  convenient  Instructions 
for  16-blt  data  transfer  between  memory  and  the  H and  L register  pair. 
These  Instructions  are  "SHLD  address"  and  "LULD  address,"  Using  the 
IHLD  (Load  Hand  L Direct)  instruction,  the  location  pointed  to  by 
"address"  is  loaded  into  the  "L"  register.  Tlie  location  in  memory 
pointed  to  by  "address  + 1"  is  loaded  into  Register  H.  Hence,  XH  must 
equal  XL  + 1,  Also,  YH  = YL  + 1, 


In  the  following  routine,  X = XL  and  Y = YL,  The  routine  performs 
the  operation  HL  <-HL*BC. 


:le6 

Code 

16 

SHLD 

X 

; X = HL 

5 

MOV 

L,C 

; HL  = BC 

5 

MOV 

H,B 

; Y » HL 

16 

SHLD 

Y 

; Y » Y*X  = 

16 

58 

LHLD 

Y 

; HL  = HL*BC 

HL*BC 


The  time  between  when  Y is  loaded  with  the  contents  of  HL  (when  multipli- 
cation starts)  and  when  Y.  must  be  present  on  the  data  bus,  during  LHLD 

Y,  is  11  cycles.  The  multiplication  unit  can  also  have  a multiplication 
time  of  up  to  5,5  psec. 


Isolated  and  memory  mapped  I/O  are  compared  in  the  following  table. 
The  advantages  of  memory  mapped  I/O  are  obvious: 


OPERATION:  HL  - HL*BC 

Software  Time*  Hardware  Multiplication 
Configuration  In  Cycles  Time  In  Cycles 

Isolated  I/O  90  8 

Memory  Mapped  58  11 

♦Software  time  Is  the  total  execution  time  of  the  routine. 
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C. 


Hardware  Implementation  for  Memory  Mapped  I/O 


Figure  A-6  presents  a basic 
the  contents  o£  the  H and  L registers  into 
X and  X using  the  "SALD  X"  instruction. 

L/  11 

the  address  of  Xj^  and  X + 1 is  the  address 


circuit  necessary  for  storing 
the  Latches  (or  shift  registers) 
The  address  X Is  actually 

of  XH.  For  complete  grouping 


of  the  hardware  requirements  of  output  and  Input  operations,  reference 
is  made  to  pages  3-8  and  3-9  of  the  INTEL  8080  Microcomputer  Users 
Manual  for  information  of  memory  mapped  I/O,  and  to  pages  2-16  and  2-17 
for  a complete  description  of  the  cycles  necessary  for  the  execution  of 
the  SHLD  and  LHLD  instructions. 


Figure  A- 7 shows  the  circuit  for  input  and  output  to  locations 
Y,  and  Y . In  these  circuits,  the  address  X is  defined  as  A15  = 1, 

Li  ll 

A7  = 1,  A0  = 0.  The  rest  of  the  address  bits  are  "don't  cares"  for  Xjj  " ^ 
+ 1;  A15  = 1,  A7  = 1,  and  A0  = 1 . 

The  address  of  Y is  defined  as  Aj^^  = 1,  A7  = 0,  A0  = 0,  The 

remaining  address  lines  are  "don't  cares"  (assuming  that  no  other  memory 
mapped  I/O  devices  are  present) . 

4.  MEMORY  MAPPED  VERSUS  ISOLATED  I/O  MULTIPLICATION  USING  THE 
Am  9511  APU 


a.  Introduction 

The  following  is  a comparison  between  memory  mapped  versus 
isolated  I/O  configurations  of  the  AM  9511  APU  in  performing  the  following 
operations: 

1)  Two-complement  8-  by  16-btt  multiplication.  That  is, 
HL  = DE*A. 

2)  Two-complement  16-  by  16-btt  multiplication.  That 
is,  HL  = HL*BC. 

b.  Hardware  Configuration: 

Figure  A-8  shows  the  hardware  configuration  of  a memory 
mapped  APU  unit.  Figure  A-9  shows  the  configuration  for  an  isolated  I/O 
conflgurat ion. 

c.  Memory  Mapped  I/O  Software,  Operation  HL  = B * HL 

The  8-bit  two-complement  number  in  A la  multiplied  by  the 
16-bit  two-complement  number  in  HL.  The  result  is  placed  in  HL. 
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Figure  A-6.  Memory  mapped  hardware  configuration  for  latching 
H and  L from  the  data  bus  Into  the  XL  and  XH  latches. 
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ADDRESS 


Figure  A-7.  Hardware  configuration  for  latching  H and  L and  load- 
ing Y and  Y into  H and  L using  the  instructions  SHLD Y , LHLD  Y , 

H Li 
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Cycles 


10 

LXI  D,  APUAD 

; LOAD  DE  WITH  APU  ADDRESSES 

4 

XCHG 

; DE  HL  or  HL  = APUAD 

7 

MOV  M,  E 

> 

7 

MOV  M,D 

; TOS  = DE 

7 

MOV  M,B 

; TOS  = B,  NOX  = DE 

7 

MVI  M,0 

y 

5 

INX  H 

; POINT  TO  COMMAND  ADDRESS 

7 (96) 

MVI  M,  SMUL 

; STORE  MULTIPLY  COMMAND 

5 

DCX  H 

; POINT  BACK  TO  DATA  ADDRESS 

♦ 7 

MOV  E,M 

; DE  = DE*B 

7 

MOV  D,M 

y 

4 

XCHG 

; HL  = HL*B 

Total  number  of  cycles  = 169  (77  program,  92  MULT) , When  APUAD  + 1 
is  put  in  the  address  bus,  total  number  of  cycles  = 164  (72  program 
MULT) . This  method  has  less  code  and  also  is  5 cycles  faster  than 
method  No.  1.  APUAD,  the  address  that  sets  CID  low,  and  APUCM,  the 
address  that  sets  CID  high,  cannot  be  consecutive. 

d.  Isolated  I/O  Software  Operation  HL  = HL*B 

Cycles 

5 

MOVE  A,B 

; MOVE  B into  A 

10 

OUT  APUAT 

> 

4 

XRA  A 

y 

10 

OUT  APUDAT 

; Clear  A 

5 

MOV  A,L 

; TOS  = HL,  NOS  = B (16-blt) 

10 

OUT  APUDAT 

y 

7 

MVI  A,  SMUL 

; MOVE  MULT  COMMAND  INTO  A 

10  (92) 

OUT  APUCM 

; SEND  IT  TO  APU,  CID  = HIGH 

10 

IN  APUDAT 

; MOVE  TOS  TO  HL 

5 

MOV  H,A 

; REGISTERS  C/D  = LOW 

10 

IN  APUDAT 

y 

5 

MOV  L,A 

; HL  = HL*B 

Total  number  of  cycles  = 190  (106  program,  92  MULT). 
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Operation  HL  » HL*BC 

The  software  is  the  same  as  tor  Operation  HL  = HL*B  except  that 
"XRA  A"  which  was  used  to  clear  the  accumulator  is  replaced  by  "MOV  A,C" 
so  that  TOS  » BC.  Hence , one  cycle  is  added  to  the  previous  program. 
Thus,  for  HL  = HL*BC,  total  cycle  time  equals  199  (107  program,  92 
MULT).  The  CID  line  must  go  high  indicating  that  the  data  on  the  data 
bus  are  a command  (multiplication  in  this  case) . When  APUAD  is  put 
on  the  address  bus,  CID  must  be  low. 

Operation  HL  =*  HL*BC 

The  two-complement  16-bit  numbers  in  HL  and  BC  are  multiplied.  The 
result  is  placed  in  HL  in  the  Memory  Mapped  Configuration. 


Cycles 

Method  No.  1 

10 

LXI  D,  APUAD 

; LOAD  DE  WITH  APU  ADDRESS 

4 

XCHG 

; DE  HL 

7 

MOV  M,E 

; MOVE  DE  (OLD  HL)  INTO  APU 

7 

MOV  M,D 

; STACK 

7 

MOV  M,C 

; MOVE  BC  INTO  APU  STACK 

7 

MOV  M,B 

; BC  = TOS,  DE  = NOX 

5 

INX  H 

; CID  = HIGH 

7 

MVI  M,SMUL 

; SEND  MULTIPLICATION  COMMAND 

5 

DCX  H 

; CID  = LOW 

7 

MOV  E,M 

9 

7 

MOV  D,M 

9 

4 

XCHG 

; DE  HL  or  HL  = HL*BC 

Number  o 

f Cycles  = 169  (77  program,  92  MULT). 

Cycles 

Method  No.  2 

10 

LXI  SP,APUDAT 

; LOAD  STACK  POINTER  WITH 

11 

PUSH  H 

; APU  DATA  ADDRESS 

11 

PUSH  B 

; HL  = TOS 

2 

MVI  A,SMUL 

; BC  = TOS,  HL  = NOS 

13 

STA  APUCM 

; TOS  = N0S*T0S  = HL*BC 

10 

POP  H 

; HL  = TOS  REVERSED 

5 

MOV  A,L 

; L H 

5 

MOV  L,H 

9 

5 

MOV  H,A 

; HL  = HL*BC 
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4.  SUMMARY 


The  results  obtained  for  each  operation  using  memory  mapped 


1/0,  or  isolated  I/O  are  tabulated 
memory  storage  for  each  method,  as 

with  the  cycle  time 
follows: 

and  necessary 

Operation 

Configuration 

No.  of 

Bytes  Storage 
Required 

Cycles 

HL  = HL*B 

Memory  Mapped, 
Method  No.  1 

16 

169 

HL  = HL*B 

Memory  Mapped, 
Method  No.  2 

16 

170 

HL  = HL*B 

Isolated  I/O 

22 

198 

HL  = HL*BC 

Memory  Mapped, 
Method  No.  1 

15 

169 

HL  = HL*BC 

Memory  Mapped, 
Method  No.  2 

14 

164 

HL  = HL*BC 

Isolated  I/O 

22 

199 

Because  it  takes  11  + 10  = 21 

(RST  + RET)  cycles 

just  to  service 

an  interrupt  without  performing  any  operation,  using  the  multiplication 
feature  of  the  APV  with  92  cycles  execution  time  with  an  interrupt  is 
unreasonable. 
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liiL 


sesuL  r3 


fCLUSTHAtES  .<Ow 

liaRko^ 


SA.'ICNf 

K 


fSh 

Cu<j 
UMC 

LXI_ 

lxT^ 

SHLO 

LXI 

^ 

STA 

Call 

HLTErf<  CaLL- 


fliflaji- 


0200H 

■^ifABu‘^»~|NH{At:HE"  PniNTF.R^^^'^ 

XPNTR  j SiaRE  X(N)  table  AOOR.  I 
M»yTAt)L  ; INITIALIZE  POINTFB 
yj£fl..,  3. A ■ ' 


N Pf)lNT£R 


STA 

■^}J- 


s£nt 

PIRO 
-£JL8jl 

SC  NT 
A 

SC  NT 
E1LI£R 


I£lli5_r.tNl..TABLL  ADDR.,H4 

.•T^iualize  sample  COUN 
SA'/E  CDUNT  IM  LQC.  scnt 

initialize 


pointer 

r 


3S00H 

I ft  * * * tt  i 

i-A-a. 


check  sample  count 

SAVE  SCNTj  SAMPLE  COUNT 
J'jNP  TO“MiyfjTTFR^~IF”  rMROUGiT^ 


COEF 

COEEPJ- 

E.J  J 

Pm 

nA 

SiH 

OFFiiSH 

aFF7nH 

8S 

Oh 

Da 

0FF76H 

079BH 

OFF7o!( 

OFFTDH 

Oa 

Oh 

OFFoSM 

300000 

Toefe.  are  stored" 


’opeel. 
‘"E'^-dh'rt- 
1 „x 

'lI 


UFO  STORAGE 


OS  IB 
DS  I 

table  mHERE  X 

gs  a 

IQO  - 


samples  are  stored 


DB 

Dd 

OB 

-8^ 

OB 

]l 


vpntr J 
.TTA&kl 


table  where 
DS 
OS 


* — Otfff 


2 

Wo 


samples  are  STIJRED 
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Jtsi  AVWlABlt  COPY 


t* 

isnrf 

llfo: 


P_  0 U I N E 


Lul 

,’(Vt 

HQVOWN!  >10  V 


TOPpP^K* 

TQPFF-I, 

^TTEnOFF 

R,ENr)FFi 
C,K»l 
AiH 


Ucx 

0 

JllP 

* * * * * 1 

■PJVDNN  1 

mo. 

Of  THE  FtKCI. 


. K 13  NUMHCW  I1F  CLEMENTS 

f^cm  i.ocatiom  is  hOvEO  OO-N 
ro  T'hF.  next  L{Jf:ATlnN  starting  from 
THE  bottom  of  the  FIFO.  HENCE 
XCN)  13  MOVE'T  TO  THE  TOP  ANO 

IS  gUH.PLfl.NOM.  THE. 

30TT0M 


■ROM-  THE-- 


"H  (rrj  TINE 


hUmSER  if4  ACC  BY 

IN  The  register  pai 

REGISTER  PAIR  mI.  fl 


**•••**•***•*********•*•* 

AM  MUl  TIPIES  THE  3 aiT..TRQ'3  COMPLEMENT. 
ME  Itj  BIT  ThiTo  CUHRlEME'IT  NiJMHER 
R DE  ANO  PLACES  The  result  IN  THE 
TRO'S  complement,  registers  DC 


I'lULT! 


H,0300M 

A 


: clear  ii  AN3  l 


MOLl.;  MVl,^ 


H 

Ci3 

enter 


j Clear  Carry  Yf-i)=o 

RN_1  E _M  U L I- 1 P L T r n .1  i-ZE  R 0 H L S 0 0 U 0 - 

; Save  bc  pair  on  stack 
initialize  count 

; start  HULTIPLICATIn. entry  POINT 

; FqR  MUl  IIBYTE  rtUt  T IVy  1C  A T lOM 

; APOXTION:  HL=MLtDE 


AGOlTl OaP 
J f^P 

SUHTR:  M.JV 


0 

count 

A > L 


CHECK  count 

this  is  to 


S TO  SUBTRACT  OE  FROM  HL 


SUBTRACT  0 FROM  H wi’’n  HORRON 

.GEnEti.AJELL  FROM  L-E» ..  — - 

CHECK  COUNT 

IE  not  IHROUGH  then  rotate  ML  TO  THE  RIGHT 
RESTORE  ACC  ANt)  CARR'' 

-BLltilH^BC-jtEGlSTLRS-- 

return  >0  CALLING  PROGRAM 

This  is  to  rotate  hl  to  the  right 
CHECK  H to  see  if  hl  is  neg. 

HSST5Yi,'.2^''T,Wf'5l  « 

rotate  H to  the  right  through  CARRY 


rotate  L through  carry 


. .wjTORE  ACC  AND  Carry 
YU^IRI-.I.'^  CARRY--.  

} EaW^  NO  CARRY  ON  STACK 

; gUBTRACT^OE  FROM  HL  IF  Y(J)xl  AnO  Y(J-n=0 
4-^J^[£_^Y_(J)  = Y(J-1)=0  _ _ 

;3AVc"aCC  AnB  carry, Y(J),  <)N  STACK 

! ADn*DF  TO  ri^'^lF^YC  j}  ’ J*  Y(J  IJ-l 


-CARTTr: RTk-R— 

Push 

JC 

JNP 


PSM 


PS4 

COUNT 

ADOIT 


stack 
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;'T  • pf ‘ r pnpY 

.)  L-w'  i t » , i ki  ki^  i«t.  i 


• This  routine  multiplies  two  ifc-niT 

•  Twg*g^Cg^^L^M^NT^M<j^>1^L9!^,.^IN.  BC-AND  OE, 

• * • 


iuTPUTi 


Ituo: 


miii 


A f 

MULlb 


irPOTR 


T 0 D 
XPNTP 

'•( 

XPNT3 


i ilOW  OE  contains  V(N) 

; load  hl  with  table  pointed 

t rinVE_L£A3T_SIG  BYTE  QE  Yfi)  INTO 

f ILlBLrT  POINT  TO  HOST  Slii.  BTTfc 
; STORE  MOST  SIU,  0>TF.  INTO  MEMORY 

: ST.1WE  table  POINTER  IN  YTaBL^IN  meh. 
; ML  NOW  contains  Y(N) 

*•*••••** 

; LOAO  ML  -w I_T H_x  c N )_t A B L E_ p 0 1 N T E R 

; acCsXCnI^ 

; point  to  X(Ntl) 

; Save  pointer 


F I 9 


digital 
1 


■SrilWF3_lN  A TaUlE 


CaLlEO  CUFFF  AS  16-BlT  TWiJ'S  COMPLEMENT 
NUMHfcRS.  XfN-KK)  ARE  STORED  IN  LOCATION 
FIFO  AS  8-flir  TWO'S  COMPLEMENT  NUMBERS 

cLi  ftR3  t-oA  tNITlALl  I A tf  ON . 

CALL  FIRi  FOR  SIibSEOOENT  RESULTS 


ZERO; 


F IRl : 


STARTj 


ZERO 
ATOO 
rlFQ,,, 
m73D0O( 
B.  3 

C»K 

■UrUEfF 

0 - 


POINTS  TO  THE  FIFO 

: K IS  the  number  OE  ADDITIONS  IN  SUM 
ClIiAR  ACC.  THE  FOLLOWING  INITIALIZES 

the  fie'O  so  that  xTn)  for 

! N<q  EQUAL  Ttl.ZERQ ...  

CHECK  COUNT 

"UT  ZtRT)ES  until  END  OF  FIFO 
A=X(N) 

_Pat.JC-(jN).  ON  TOP  OF  FIFO  AND 
;0uh(>x(n-k-i) .clear  Sum 
: B contains  LOWER  ADDRESS  OF  COEFF, 

K IS  the  number  of  samples  summed 
DE  POINrS  TO  THE_T0P...0F,  FIFO 


OUTPUT 


I R^sfoSI  He  ^rom'^stack 

; ACCsX(N-kk) 

■f~SAV^slMPL^''ADDRHs  ON  stack 
JSAVF  TNTERHIDIATE  SUM  ON  STACK 
jMOVf?  COEFF  COUNT  INTO  REG,  L 

j POINT  TO  .host  SIG  byte  OF  COEFF. 

; HOVE  IT  Into  d 

■j'sSi^^CO^Fl^^’^CUUNT^lN  REG.  S 
; MULT  multiplies  DE  BY  ACC  PUTS 
: RESULT  IN  HL,  HL=aCC*UE 
iJL  = X(N-KK)  .hUk) 

I^JT  injermtuiate  s jh  from  stack 
j Into  OE  registers  and  ado  to  hl 

; HL  = SUr1»,<(N-KK)*H(KK)  , OECREHFNT  ( 
_L_ChEC.K  if  K CLEMENTS  SumhEO 
; RFYuRN  ADDRESS  NOw  ON  STACK 
; STORE  Y(N)  IN  table 


COUNT 
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^ S »- 


-4.- 


I^t  FOLLUWIUG  PH(JCHA!i  CALCUlATtS  Y(M),  THF  RtSPONSF  OF  A 
. K-fH  nnDtn  wigTal  rii  tfm  to  am  INPUT  x(n).  T^L  filtfr 

• IS  lHPi.Ef*E  SICTIOnS.  each  SECTIiJN  IS  A SECOnO 

1HA4  calculates  the  FOLLWlN&t  " 

• v(N)aAO*X(N)♦A^•X(N-l)♦A^*X(N-^)-RI*V(N-^)-B^:•Y(N-^) 

J_-»(»A4-r^XCAt«4A-.„,vCW^I>T.  ..  r»OtAUA«»,-B|  ,Hg/  ARE  A».L 

. 16-Hir  Tho’S  CQHPLEMENT  NUHflERS, 

I SINCE  EACH  SECnON  NEEDS  ITS  Own  COEFFICIENTS  AND 

VLALu6i»  TmE  PHOCfSAH-MAS  T wO  TABLES,  THE  COEFF. 

• TABLE  AND  TMf  XY  TaRIE.IN  THE  LATTER  .THE  PREVIOUS 

• values  XCN-Il.XfN-ay,  are  STOREO.ThE  tables  are 

• OlVlpEO  INTO  SECTIONS,  EACH  SECTION  CONTAINING  COEFF. 'S 

*_AJiO-p«tJ<44iA-V- (AAtTJEi- FOR  A PfflTICULAR  filter  SECTION, 

• THE  contents  of  THE  XY  TABLE  ARE  MOOIEIEO  AT  THE  F nO 


.IF  Each  section  CALCULATIPN  that  is,  x(n)  RECQHES  X(N-1 
and  X(N-iy  bECOHES  X(N-?)  ETC,  THE  XY  TABLE  IS  REFERRED 
_LIA— A4— X.Y-18L,_A--P0I-NTE»  CALLED  XYPTH  POINTS  TO  THE  APPRO- 
PRIATE sectIqn  address  in  the  table  at  the  rec.inning  of 
lACH  IIR  filter  section  calculation. EACH  SECTION  IN  THE 
table  LONTAInS  The  FOLLO-ING  previous  conditions  IN  THE 
I«*04ti-Tf{Vt- Y(N-?y  . ¥(N-J  >,X(N-^)  , X(N-1  ) . 

THE  CQtFFIClENTS  FOR  EACH  SECTION  ARE  STORED  IN  THE 
FOLLUBING  ORDER:  P2 , R 1 , A2 , a l . a O . 

A single  ROUTiNf  IS  USEO  TO  CALCULATE  THE  IIR  RESPONSE 


fiNf  IS  usEo  to  Calculate  the  hr  response 
-H«m-A-PflEvious  section,  this  routine — 

THF  PREVTOUS  values  ANO  COEFF. '3  FOR  THAT 
PEEN  HOVfO  FROM  THE  TARl  E3  TO  THE  FOLLOWING 
tN-?)  INTO  YNMj,  Y(N-I)  INTO  YNHl,  X(N->) 


sCcflUN  b*vE  PEEn  HOVfO  FROM  THE  TARi E3  TO  THE  FOLLQwINt 
LOCAtiUNSi  tCN-2)  INTO  YNMj,  Y(N-n  INTO  YNMl,  X(N-2) 

XmHir  X(N-l  I— II^TO-  XNHl-ANa  XTNI-TnTQ- XNHO  AND  THF  — 
COEFF.  INTO  LOC.  R2,  COEFF.  Bl  INTO  LOC  Bl,ETC, 

The  UUTPLT  from  each  SECTION  T3  STORED  IN  LOCATION  XNMO 

-tO-BE  USED  AS  IFieUT-- TO  the  NEXT  SfCTION.  - 

the  final  output  is  OBTAIWED  IN  THE  HL  REGISTER  P*IR, 


-IT^TP-XNHl-ANtV  XT  N>- 
R2,  COEFF.  Bl  IN' 


j * 

».*.*.*  •>>«.**>** 

llR:  LXi  h,xyTbL  j 

ShLD  xyptr  ; 

I xl  H.CFTBL  > 

5ml  D &F  WR F 

LHLD XN  I 

Rvt  A,kd2  / 

SECTNi  ST*  SECnT  I 


SECTNl 


J INITIALIZE  XY  table  pointer, pointing 
; T()  pReveOUS  VALUES:  X(n.2), 

> INITIALIZE  COEFFICIENT  TABLE  POINTER 

r~HT»X(Nj  ASSUMING  X{N)  IN  LOC.  XN,  (A/D  MM) 
t INITIALIZE  SECTION  COUNT 

|M)2*K/Z  Is  NUMHtft  OF  sections,  SAVE  COUNT 
»_STnHR-X(N)  INTO  LOC.  XNMO  — 


• PREV  mqvES  The  PREVIOUS  VALUES  Y ( N-2 ) , Y ( N- 1 ) , . . INTO  * 

-LUCA  I,  J OAFS X4M«i,-Y»aiL.-r,,  USED  BY  I IRS  TO  CALCULATE  Y(N)  * 

* * 

* • 

LHLD  XYPTR  ILOAD  HL  WITH  XYPTR  WHICH  POINTS  TO  PREVIOUS 
SpJIL , » VALUE  SECT  ION  IN  XYTRL.-MnVE  TO  STxCK  FlIINTB 


PREV! 

LHLD 
SfOiL- 

XYPTR 

MUV 

AOI 

JnC 

inh 

“ ITIT  ■ 
s 

NEXT 

H 

NExir 

MOV 

ShLO 

X^’PTR 

POH 

u 

Shlo 

Pllp 

YNM2 

h 

YNMl 

M-  - 

SMI  0 
POP 

Shi  D 

XNH2 

H 

XNM  1 

calcijl*te  HLsHLyR 
Increment  h if  carry 


H L s H L ^8 

modify  pointer  to  point  to  next  section 

XYPTRkXYPTU^A 

-STACW  POINTS  TP  Y{N-2)  SO  POP  H 
LOAC3  yTn-?2  into  h and  l, 

PUPPING  AGAIN  LOADS  Y(N-Jj  INTO  HL, 

STORE  Y(N-2),Y(N-t)....AS  they  ARF  POPPED 
Into  locations  ynh2,ynmi,xnh2,xnmi . 


CFHOV  MOVES  THE  COEFFICIENTS  FOB  THE  SECTION  TO  BE 
calculated  FHOM  CFTBL  to  locations 


^FMOVi  LHLD 
SPHL 

MttV — 


CFPTR 


AUl 
JnC 
I nr 

UlXXZt — nuv- 


shld 

POP 

-SNt^ 

PqP 

Shld 

POP 

Shld 


NEXT2 
H 

-Mr 


CFPTP 

H 

-BJ 


POP 

SMLO 

POP 

-SHtD- 


H 

B1 

H 

-A^- 


H 

A1 

H 

-*4- 


MOVE  CFPTR  khICM  points  TO  COEFFICIENTS 
FOR  section  TO  BE  CALCULATED  TO 


STACK  Pt>WER~ 
CALCULATE  HLaHLtJO 
INCBEM"  - -- 


MENT  H IF  CARRY 


MOD^FY  POINTER  TO  POINT  TO  NEXT  SECTION 
CFPTR=CFPTR+10 
MOVE  82jBl,,..  INTO  LOCATIONS  B2,Bt..., 
-aY-W)PPlNfi--62fBl » , . , INTO-HL— and -THEN  — 


storing  Them  Into  locattons  B2,bi, 
used  to  Dalculate  iir  response. 


the  following  calculates  Y(N)  for 

ACCOHOINC  To  ThE  FOLLOwTNG-> 


AN  HR  SECTION 


Y(N}=A0*XCN)*AUX(N-n.A2i*X(N-2)-Bl*YtN-l  )-B2*Y(N-2) 
*Y(N-25 , Yf N-U) ,X(N-2) ,X(N-1 ) , And  X(NJ  MUoT  be  in  LOCATIONS 
•YNM2, YNmI. XNM2, XNMl » and  XNHO,  THE  COEEFICIENTS  MUST  BE 
* IN  LUCATinNS-ft2-,6i  »A2,A-t  rAOr-Y(N)— IS- CALCULATED  AND-PUT  - 
*IN  TRE  H AND  L registers,  B2,B1,.,.  ARE  DESTROYED, 


ilflS;* 


PUP 

LMLD 

Call 

XCHtt- 


SPy  B2 F— SXX€K-PaiNTS-TO-B2-WHICH-CUNTAINS-COEF,-B2- 


B 

YNM2 

MUL16 


POP 

m 

XCHC 


8 

YNMl 

Mul.16 


DAO 

XHA 

SUB 

-MO-V 


Vd 

MOV 

LHLD 

ilV: 


A'** 

H»  A 


XCHG 

POP 

-m- 


B 

XNM2 

MUL16 

-D 


OAQ 

POP 

LHLD 

CALfc- 


B 

XNM  I 

■HUL-4«r- 


DAD 


XNMO 
MUL  Efr 


DC=02 
HL=Y(N-2) 

HL=Hl*BC»  HL=Y(N-2)*B2 


*61 


HL  = 8l*Y(N-nyB2»Y(N-2) 

Clear  acc, 

A30-L 


Careful  nut  to  destroy  carry 


HLaX(N-2) 
H^^aX(N,2)«B2 


B-t  •- Y-tNM -V-B  a *^Y^(  N - 


DE=HL 

BC8A1 


A-2*-X  tN-2  f N-L>*B  1 < 


HL  = XfN-n 
-Ht»A4*-X  ( N-t-)- 


Aj^*X(N-t  )♦.  , .-Bl*Y(N-l)-.  . . 

HLax tN) 

-44t-»A4*-X  ( N ) — — — 

HL  = Y(NTaAO*X(N)»Al*X(N-l)*..,-Bl*V(N-n-. 
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Aru  «#-•«■» 


I 0E3V(N)  . 

j MOVE  XV  Table  pointer  which  points  to 

tNEXT  3ECTIgN_Tg  STACK,  

; V(N)  ANO  r(N-n  THAT  BECOME  Y(N-J),Y(N. 

»T.«Lncxt.time  around.  _ 


; THE-BFe-frew-fN-XYfB^SfJ-ARE 

; V(N)  and  Y(N-n  THAT  BECOME  V ( N- J ) , Y ( N-2  ) 

jTHf.  next  time  around, 

; OECREmEnI  section  COUNT 
; CHECK  TO  SEE  IF  LAST  SECTION. 

j EXIT  WITH  Y<N)  In  h ano  l register  pair 


MOVE; 


MOVE  I ; 


rH»  FOLLUMINCi  IS  AN  SOHO  procpam 

- * ■-  '-ST  FOURIER  TRANSFORM. 


RA^  li. 

j SToR'" 

EaL  an 


FOP  decimation-in-time 


Called  xtabl.  each  sampie 
Ic-HIT  Twu’s  complement 


£Lfs^*RE..^TQflED-lN-A  .TABLE 
FJ  ASA  COMPLEX  number  hITh 

„.b  IMAGINARY  parts. 

REFFHENCE  IS  MADE  TO  THE  FORTRAN  DECIMATION-IN-TIMe,  RADIX  ?., 

TM.P(Ar.F  FPT  PHUTInF  Hv  roill  y.Lf.AlS,  AND.  WELCH  CPAGE_J6/ . 

RABINEU  AND  GOLD'S  •'DIGITAL  SIGNAL  PROCESSING"), 


1i 


THE  MAXIMUM  number  OF  .SAMPiF.S.N.  IS  NHAXA?Sb. 


and  imaginary  part 

oP  the  icMUEx  Variable  addressed  by  hi.  Into  the  location 
AD,>BE35EU  BT.  QC. - 


Mvl 


c,4 


initialise  COONf  IN  c REGISTER 


MOV 

STAX 


A,i^ 

D 

-H 


InX 

DCR 

JNZ 

-RF.t- 


D 

MOVEl 


; MOVE  FIRST  BYTE  OF  REAL  PART  INTO 
; accumulator  and  from  there  STORE  THEM 

; InTC  LUCATIUN -ADDRESSED  BY  DE.  

j Increment  hl  and  de. 

J DU  FOR  FOUR  bytes. 

; CHECK  TO  SEE  IF  FOUR  BYTES  MOVED, 


» TMI  following  does  The  IN-PLaCE  Bir-REVERSEO  SHUFFLING, 

• RCGISTfR_Q  Contains  .IH£_dLT-REVEKSED  CODE.  OF  ThE  COUNT  PLACED 

• I N^^is TEH  r, N IS  The  number  or  samples  which  is  placed 

• IN  register  oJ 


— } 


nP^Sj  I.XI 
MV  I 
XRA 

M'lV 


M,XTABLC» 

D,N 

^ ’ 


HL  POINTS  TO  Table  with  samples  x, 

MO'^E  NUiIBER  of  SAfH>LES,N,  InTo  RFG.  D. 


L,A- 


CLEaR  ACC. 

jlIlEAfl_aEG.. 


E.  _ 


; 


MOV 


a, A 


shffl:  sub  E 

Jl  NONEED 

JC  NUNEED 

RFVFHSF;  PUSH  H 


DAD 

PUSH 


PUSH 
PUSH 
LXI 

•'OV  l.fA 

OaO  h 

dad  h 


E RILL  contain  count  aS  HE  MOVE 
DOWN  THE  XTaSl  CONTAINING  SAMPLES. 

Clear  b. 

ji_nLL-CONTAlN  BIT  REVERSED  NUMBER  aF_D.- 
A = li-F  . 9>C  THf.j  REVERSE  pOSITTUMS, 

IF  H = E then  Nt)  need  to  reverse. 

TF  E>B  then  AihEADY  REVERSED  BEFORE. 

.Sa.ve_de  regisTeps. 

Save  aDDRESSS  of  current  sample  address 
Save  two  lfvels  deep  on  stack. 

OAO  DE  with  add.  of  temporary  LOC. 
sample  INT0_TEMP  LOCATION, 


HL=P*A 

JjU_s'IhA 


PUP 
LXI 
XCHG 
Call  muvl 


H 

0#  Tempr 


POP 

pqp 


Mob^ 

INX  H 

lux tt 


0 t HlsCURREnT  sample  add,*  u*a, 

H ; hC  now  contains  AODRESSS  OF  LOC,  TC  BE 

; shuffled  In-PL*CE.  SAVE  ml  on  STACK 
HilVF  I MtIVF  SAMPLE  IN  LOC,  ADDRESSED  BY  HL  TO 
LgCATION  Ot(  CURREt.T  SAMPLE) 

LOAD  DE  WITH  address  OE  Tr.MP  lOC. 

SWITCH  DE  AND  HL 
-ilOVE-- sample  in  LOC.  TEMP  INTO 
LOCATION  (IBTAI'jED  by  BIT-REVERSAL. 

RFSTORE  address  of  current  sample, 
restore  de  registers. 

_Ljgi6lENT  COUNTtR  t TO  POINT  TO  NEXT  SAMPLE 
^T-ReVLR3E  E,  RETURN  WITH  BIT-REVEHSED 
number  in  ACC.  PUT  IT  into  8, 


H 
0 

aiTREV 

a, A 


JNX  H ) 

INX  M ; 

OCR  0 J 

-JjiZ StlEEU. _t- 


hl  = HLf«»hl  now  points  to  NEXT  Sample  In  xtabi. 
CmElk  to  see  IE  N SAMPLES  SHUEFlF.D. 

KEEP.U.N  Shuffling. 
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^ i t\ 


f ^ "5  Ar>'  c 

i'U  ij‘  iL>i.L 


COPY 


THE  FOLLOKINC  ROUTINE  OBTAINS  THE  NEXT  BIT-REVERSEO  NUMBER 

■ digital  signal  PROCESSING". 0,^  EXIT  THE  BIT-REVERSEO  NUrtBER 
IS  IN  the  accumulator. 


IITREV 


I MOV 


Rijfx: 

ROTx: 

RuTx: 

-R-KUIN;- 

Rh 


A,B 

C,9-LN 


START 

lRUTN 

C,N/2 


j .MOVE  PREVIOUS  BIT-REVERSED  NUMBER 
; Into  ACC-  the  BIT-RE'/ERSIu  number  must  he 

■X-UURHALlitD.-XN»L0G(BA3E:  tNO)  C N 1.  . - 

; MENCr  3-LN  ARE  THE  NUmHER  QF^TImES  THE 
; NUMBER  mD3T  be  shifted  TO  .ThE  LEfT, 

I EXIT  CONOITION, 

; RQTMC  ACcUtQ  fHL  LEFT--  

{ Start  bit-reversal  process 


; As  BIT-  REVERSED  NUMBER. 
; RESIN  OENORHALIZATION. 


j return 



; X»( 1-2-R. . .-N/2)  Hr  CALCULATING 
';(t-2-4-,..-N/2)  and  PUTTING  IT  IN  C 


THIS  PuR^-iy.jTj  ^ IMPL^F-^^^-^E-EOUA-V ALENT  -OF-  the-  following 

‘ LE  s 2»*l' 

LclsLtV2  ^ 

w"^H?Lti?i3s(PI/LEl)/SIN(PI/LEl)) 


DO  10  I = J,N,LE 
DO  20  J = t.LU 

— 

10  A(n  =A(n  ♦ T 


20  U 3 U*R 

storage  location  LL  contains  'L'f  registers  0 AND  F.  contain 
■Q'TanO  RFSPIECTrvELV.  LEl  IS  KEPT  IN  REGISTER  B AND 
LE  IS  KEPT  IN  REGISTER  C. 

ALL  COMPLEX  VALUES  MA'/E  16-DIT  TwO'S  COMPLEMENT  REAl.  AND 
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■ . - ’ r.  ; I : l 

/ i V i'kt  k-l  lt,>uL 


COPY 


PO>4tRt 


POWER 

C.A 


H,ZEBO 

JIMAG 


wcalc 
A'  \ 

— 

Ef  A 

H 


Call  auTTPLr 


N 

SKIPl 


> nrr,tn  no  loop  with  'l*  = i 

-Hiy^^‘T>^.V^c'-?S*JJP?uhitTE  LE  later 
; This  portion  calculates 
; As2«*L 

T“IhI3~IS*  QONE'BV  shifting  a=i  to 
} thC  right  'L»  times. 

I Ca2**L-l.E 


; LOAD  HL  WITH  HEAL  l.O 
J STORE  HI.  INTO  ureal 

"FACCRslTD 
; LOAD  H WITH  0,0 
; UIHAGaO.O 

0 r 8 To”)  . A C C = rr.  0 ."O  . 0 ) „ 

; Calculate  w = cMPux(cus(Hi/LEn,siN(Pi/LEin 
I INITIALIZE  00  LOOP 
U='J-I 

t E=*I' 

; iIPlsAs'Ii+LEl 
; PERFORM  HuTTERFLT 


3blP?. 


A='I'  ♦LE 

COMPARE  WITH  N 

IF  * I'>N  EXIT  LOOP 

■STAX.-aT-HERWlSL-  - 

Calculate  u=u«w 
•j'=' J'+i 

■cahilRE  •J-l  HLTH_L£1--  

FxtT  LOUP  IF  'J'>LEl 
STAY  otherwise 
As'L  ' 

_LLi-i!-LVt.L _ . - 

compare  'l'  with  imi 

return  if  'L'>'M' 

STAY  IN  LOOP  OTHERWISE 


• THE  FniLOWiiNG  MULTIPLIES  THE  COMPLEX  NUHtiERS  AUOHF.SSEO  0t  HL  . 

* (CalleiJ  asA+b«oQRT (-1 ) ) HY  y=aptHp*sqrt(~i ) addressed  OY  Ot.  « 
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f'M 


c:r( 


t * 

.►UJLI: 


Push 

TiUc 

I NX 
HUV 

WL^ 


iti 


-rnTT 

InX 

MOV 


:a 

MOV 

SOS 

JUIV. 


MMV 

3BB 

MOV 

XC^G 


LHLO 

XCMG 

SmL3 

-POi! 

MOV 
MOV 
I NX 
-UVX- 


POSH 

MOV 

INX 

JOV- 


XCHG 

C^LL 

XCH6 

i.Mi  n 


POP 
Call 
0*0 
-SML 


_d_ 


C,M 

M 

H.M 
J 


H 

*CCP 
MOL  1 6 


B,M 


Htf — ftSH 


-L*4- 


*. 

H, 


ACCR 

*CCR 
_H.. 


C,M 

0.H 

H 

_U 


B 

C,M 

H 


MOL  16 
Ar.CI 


OF  X. 


miTh  REAL  Part 


“5X71 

POT 


'iflMAG. 

ieal  part 


ADDRESS 
OF  Y IN 


IN  DE 
ML,  HL»AP 


XL 


BC  REGISTERS.  BCsH 

4L=fle_(lHAG.  PART  QF. 

HLaHt*OC  UR  hL=B*HP 


START  subtracting  j Ul=DE- 
XMAX_L3_HLaA*AP-B*BP  


ML 


X -START  Pitting, real_p*hT 


OF  Y 

heal 


before  St.jRING  COMPUTED 

In  accumulator, 
store  A»AP-B«0P  into  complex 

HLSIURE.  X REAL  PART  ADDRESS 


IN  Pt, 

PART 

acc.  real  part 


MOVE  XREaL  PaRT,a»  into  BC  REGISTERS 
U<U_mii»_£QlMTS  TO  IMAG^  PART  OF  X-.a.. 

Save  real  part  of  x,a,  on  stack 

MOVE  IMAG  part  of  X,B,  INTO  HC  REGISTERS, 


3 

MOL  16 
D 


DE  , RHICH  contained  AP 
HLsML*BC.  OR  HL=AP*B. 
0EaAP*8. 
ilLxaa, 


, IS  MOVED  TO  Ml. 


restore  X REAL  PART, A,  INTO 
“ OR  ML=A*HP 


BC  REGISTERS,' 


. hlsml*bc, 

; hl=hl+0£,  or  hl=a»hp4AP*b, 

^_ACCyMiJLAlORsX*Y=(Ai»AP-B*BP)*(AABPAAP*Bl*I 


XdE  following  A TE S L-- 

ACIP)=ACI)-T 


* A(nsA(n+T 

-o-piRST -THE  ADeRfe^E-S-OF-A(IP)  AND  Ad)  aRE  OBTAINED  FROM. 

♦ registers  SsJp  AwO  Fs!  RESi>EC  TI  VL  Y,  BC  WILL  POINT  TO  Ad) 

» AND  o£  WILL  Point  to  a(ip). 
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^ rnpY 

* 


rTFLVl 


! PUT  E INTO  c AND  0 IN  H,  ML=E 

I I HAn  HE  WITH  XTAni  

: hl3^*ML.  UR  M(.=E^K 

I OR^HLsa^t 

J HL3XTaHu*U*E 

-fih^asos'iuxn  — 

1 HLa2«HL.  UR  HLaa«A 

■l~RL  = ?fri|  , UR  HL  = «i*A 
; HLr;(TAaL*R*A 
J OE  NOW  POINTS  TO  A(IP) 


* OE  POINTS  TO  A 

• ACCUMULATOR  IN 


(IP),  BC  POINTS  TO  Ad).  T IS  IN 

mlnory  nhicu  is  addressed  by  sc. 

; Save  aCIP)  address  on  stack 

; 

f~HW£'U~XNTO  ACC. 


T IS  IN  The  COT'PLEX 


IL  POINTS  T 

T 


NTS  TO  A(IP) 

X multiply  a(IP)  by  IJ.  ACC=A(IP)*U 


restore  address  of  ACIP) 

_SAV£  address  UF  A(I)  . 

start  calculating  A(IP)=A(I)-T 
the  real  part  of  Ad)  POtNTF.D  TO 
3Y  ac  IS  SUBTRACTED  FROM  T IN  ThF 
.ACCUilULATOR  and  STORED  IN  THE  R£AL 
Part  OF  AdP)  POINTED  TO  BY  DE. 

THE  SAME  IS  DUNE  FOR  THE  IMAGINARY 
PART. 


J__Iia_3.VilE_f.0rL  iMAGIrJARY  PA»T. 


R^ST^RE  address  of  ACI) 

LOAD  HL  with  REAL  PART  OF  T 
START  CALCULATlNd  A(I)  = Ad)  + T 

STORE  DaCk  Into' a (I) 


DO  FOR  imaginary  PART  ALSO 


r^HqhP  ALL  REGISTERS 


RETURN 
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